Main by jgibson2 · Pull Request #5 · PolyCam/executorch

jgibson2 · 2026-04-24T15:49:01Z

Summary

[PLEASE REMOVE] See CONTRIBUTING.md's Pull Requests for ExecuTorch PR guidelines.

[PLEASE REMOVE] If this PR closes an issue, please add a Fixes #<issue-id> line.

[PLEASE REMOVE] If this PR introduces a fix or feature that should be the upcoming release notes, please add a "Release notes: " label. For a list of available release notes labels, check out CONTRIBUTING.md's Pull Requests.

Test plan

[PLEASE REMOVE] How did you test this PR? Please write down any manual commands you used and note down tests that you have written if applicable.

@freddan80

…18767) - Add support for quantized clamp-type activations in the Cortex-M pipeline by canonicalizing relu/hardtanh/clamp to quantized aten.clamp.default for standalone int8 paths - Extend activation fusion to cover max_pool2d. @freddan80 @per @zingo @oscarandersson8218 @digantdesai @Sebastian-Larsson @AdrianLundell @psiddh cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell Signed-off-by: Xingguo Li <xingguo.li@arm.com>

…ch#18971) FuseConstantArgsPass resolved input_qparams by flattened input-node index, while FoldAndAnnotateQParamsPass stores them by top-level argument index. For aten.cat with a list-valued tensor argument, this caused only the first tensor to be dequantized before folding, which corrupted the fused constant. Resolve qparams by top-level argument index and propagate that qparam through nested list and tuple arguments. Add a regression test for quantized aten.cat constant folding with list-valued tensor inputs. Signed-off-by: Per Held <per.held@arm.com> Change-Id: I6e1a012d82a5dbeecb403c440a2944953dd5cba7

Fixes pytorch#10736 Formats `third-party/CMakeLists.txt` using `cmake-format` to improve readability and consistency. **Changes:** - Reformatted `ExternalProject_Add(...)` blocks for `flatbuffers` and `flatcc` - Reflowed `set_target_properties(...)`, `set(...)` cache variables, and `install(...)` calls - No functional changes — formatting only

@ignore

All 4 tests failed because they called forward() with zero arguments on mobilenet_v2 which expects a [1,3,224,224] float input. This was a test bug, not a runtime bug. Add a dummyInput() helper that creates a Tensor.ones with the correct shape, and remove all @ignore annotations. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Differential Revision: D101887672 Pull Request resolved: pytorch#19035

Differential Revision: D102189156 Pull Request resolved: pytorch#19077

…rch#19092) Add cause-chaining constructor to ExecutorchRuntimeException so wrapped exceptions preserve the original cause in the stack trace. Restore detailed native error messages in LlmModule.load() — the null runner case now reports the model_type_category and valid values instead of a generic message. Load failures now throw from JNI with the specific error code and description. This commit was authored with the help of Claude.

…ytorch#18959) Summary: The CUDA runtime shims for sort operations use Half (float16) dtype, but it was not defined in the slim ScalarType enum, causing compiler warnings treated as errors (-Werror=switch). This adds proper Half support to the slim ScalarType enum so switch statements can use the enum value directly instead of casting to the underlying type. Differential Revision: D101218928

1. Attacker sets that flag on an external tensor. 2. xnnpack thinks the tensor is owned by itself, and frees it inside the backend. 3. et runtime also frees it at method destruction. Test Plan: Build and run executor runner against problematic PTE file: ``` # Build executor runner: cmake -B cmake-out \ -DEXECUTORCH_BUILD_EXECUTOR_RUNNER=ON \ -DEXECUTORCH_BUILD_XNNPACK=ON cmake --build cmake-out -j16 --target executor_runner # Output (executorch) [lfq@devvm11764.nha0 /data/users/lfq/security/executorch (f9f29e7)]$ ./cmake-out/executor_runner --model_path=/data/users/lfq/security/executorch_repros/TOB-EXECUTORCH-44.pte ``` Previous ``` (executorch) [lfq@devvm11764.nha0 /data/users/lfq/security/executorch (security44)]$ ./cmake-out/executor_runner --model_path=/data/users/lfq/security/executorch_repros/TOB-EXECUTORCH-44.pte Note (XNNPACK): l1_data_cache_bytes=32768, l1_data_cache_line_size=64, l1_data_cache_associativity=8, l1_data_cache_num_sets=64. (init_hardware_config, /data/users/lfq/security/executorch/backends/xnnpack/third-party/XNNPACK/src/configs/hardware-config.c:417) Note (XNNPACK): l2_data_cache_bytes=1048576, l2_data_cache_line_size=64, l2_data_cache_associativity=8, l2_data_cache_num_sets=2048. (init_hardware_config, /data/users/lfq/security/executorch/backends/xnnpack/third-party/XNNPACK/src/configs/hardware-config.c:436) I 00:00:00.002612 executorch:cpuinfo_utils.cpp:71] Reading file /sys/devices/soc0/image_version I 00:00:00.002640 executorch:cpuinfo_utils.cpp:87] Failed to open midr file /sys/devices/soc0/image_version I 00:00:00.002657 executorch:cpuinfo_utils.cpp:100] Reading file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1 I 00:00:00.002664 executorch:cpuinfo_utils.cpp:109] Failed to open midr file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1 I 00:00:00.002671 executorch:cpuinfo_utils.cpp:125] CPU info and manual query on # of cpus dont match. I 00:00:00.002672 executorch:executor_runner.cpp:223] Resetting threadpool with num threads = 0 I 00:00:00.002722 executorch:executor_runner.cpp:374] Model file /data/users/lfq/security/executorch_repros/TOB-EXECUTORCH-44.pte is loaded. I 00:00:00.002729 executorch:executor_runner.cpp:384] Using method forward I 00:00:00.002739 executorch:executor_runner.cpp:435] Setting up planned buffer 0, size 112. E 00:00:00.002806 executorch:XNNCompiler.cpp:331] Tensor value has unsupported flag bits 0xffffff00 E 00:00:00.002824 executorch:XNNPACKBackend.cpp:122] XNNCompiler::compileModel failed: 0x23 E 00:00:00.002827 executorch:method.cpp:127] Init failed for backend XnnpackBackend: 0x23 F 00:00:00.002830 executorch:executor_runner.cpp:459] In function main(), assert failed (method.ok()): Loading of method forward failed with status 0x23 Aborted (core dumped) ``` After, graceful error ``` (executorch) [lfq@devvm11764.nha0 /data/users/lfq/security/executorch (security44)]$ ./cmake-out/executor_runner --model_path=/data/users/lfq/security/executorch_repros/TOB-EXECUTORCH-44.pte Note (XNNPACK): l1_data_cache_bytes=32768, l1_data_cache_line_size=64, l1_data_cache_associativity=8, l1_data_cache_num_sets=64. (init_hardware_config, /data/users/lfq/security/executorch/backends/xnnpack/third-party/XNNPACK/src/configs/hardware-config.c:417) Note (XNNPACK): l2_data_cache_bytes=1048576, l2_data_cache_line_size=64, l2_data_cache_associativity=8, l2_data_cache_num_sets=2048. (init_hardware_config, /data/users/lfq/security/executorch/backends/xnnpack/third-party/XNNPACK/src/configs/hardware-config.c:436) I 00:00:00.002562 executorch:cpuinfo_utils.cpp:71] Reading file /sys/devices/soc0/image_version I 00:00:00.002595 executorch:cpuinfo_utils.cpp:87] Failed to open midr file /sys/devices/soc0/image_version I 00:00:00.002607 executorch:cpuinfo_utils.cpp:100] Reading file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1 I 00:00:00.002618 executorch:cpuinfo_utils.cpp:109] Failed to open midr file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1 I 00:00:00.002623 executorch:cpuinfo_utils.cpp:125] CPU info and manual query on # of cpus dont match. I 00:00:00.002628 executorch:executor_runner.cpp:223] Resetting threadpool with num threads = 0 I 00:00:00.002672 executorch:executor_runner.cpp:374] Model file /data/users/lfq/security/executorch_repros/TOB-EXECUTORCH-44.pte is loaded. I 00:00:00.002678 executorch:executor_runner.cpp:384] Using method forward I 00:00:00.002688 executorch:executor_runner.cpp:435] Setting up planned buffer 0, size 112. E 00:00:00.002750 executorch:XNNCompiler.cpp:331] Tensor value has unsupported flag bits 0xffffff00 E 00:00:00.002761 executorch:XNNPACKBackend.cpp:122] XNNCompiler::compileModel failed: 0x23 E 00:00:00.002769 executorch:method.cpp:127] Init failed for backend XnnpackBackend: 0x23 F 00:00:00.002772 executorch:executor_runner.cpp:459] In function main(), assert failed (method.ok()): Loading of method forward failed with status 0x23 ``` Co-authored-by: Github Executorch <github_executorch@arm.com> Co-authored-by: Claude <noreply@anthropic.com>

…M (v1) (pytorch#18859) The original SmolLM2 PR (pytorch#9354) started as v1 support, was renamed to `smollm2` during review, but the repo ID and `rope_theta` were never updated to v2 values. The two checkpoints are genuinely different models (0/272 tensors match). - `HUGGING_FACE_REPO_IDS["smollm2"]`: `HuggingFaceTB/SmolLM-135M` → `HuggingFaceTB/SmolLM2-135M` - `examples/models/smollm2/135M_config.json`: `rope_theta` `10000.0` → `100000.0` (matches [SmolLM2-135M HF config](https://huggingface.co/HuggingFaceTB/SmolLM2-135M/blob/main/config.json)) ### Test plan Data-only change (one string, one number). Verified values match the upstream HuggingFace SmolLM2-135M config.

Add tryTo accessors for each value. Previously, `toTensor` etc. abort with ET_CHECK_MSG when the type mismatches. API additions: - Per-type: tryToInt, tryToDouble, tryToBool, tryToScalar, tryToString, tryToTensor (already present, kept), tryToIntList, tryToBoolList, tryToDoubleList, tryToTensorList, tryToListOptionalTensor, tryToScalarType, tryToMemoryFormat, tryToLayout, tryToDevice. Tag mismatch returns Error::InvalidType; null list/string payload returns Error::InvalidState. - Templated tryTo<T>() dispatcher mirroring to<T>(), via a new EVALUE_DEFINE_TRY_TO macro kept adjacent to EVALUE_DEFINE_TO so drift between the two surfaces is visible at review time. - tryToOptional<T>() widened from Tensor-only to generic, delegating to tryTo<T>() so it works for any supported payload type. Tests cover success + mismatch paths for each new accessor, plus the widened tryToOptional<T>() path. Authored-with: Claude --------- Co-authored-by: Github Executorch <github_executorch@arm.com>

…rity (pytorch#18917) Differential Revision: D99769848 Pull Request resolved: pytorch#18917

…rch#19095) This PR makes GPU related operator cuda-backend specific, to bring metal qwen 3.5 moe ci back

@digantdesai

Disable fusing of ops that have symbolic shapes as arguments. Also disable fusing of TOSA dialect ops. cc @digantdesai @freddan80 @per @zingo @mansnils @Sebastian-Larsson @robell Signed-off-by: Oscar Andersson <oscar.andersson@arm.com>

@digantdesai

Adds util for computing a value range from a symbolic expression. cc @digantdesai @freddan80 @per @zingo @mansnils @Sebastian-Larsson @robell Signed-off-by: Oscar Andersson <oscar.andersson@arm.com>

The removed copy seems to be stale, it is never used.

@larryliu0820

…ch#18088) ## Summary This PR adds a fused `llama::recurrent_gated_delta_rule` custom op and wires Qwen3.5 GatedDeltaNet attention to use it instead of the Python per-token recurrence loop when the op is available. It also tightens local custom-op loading so we no longer implicitly scan repo-local `cmake-out*` directories, and adds coverage for recurrent-state correctness, chunked prefill behavior, and export graph selection. ## What changed - added `llama::recurrent_gated_delta_rule` runtime and AOT registrations - updated Qwen3.5 GatedDeltaNet attention to use the fused op with Python fallback preserved - tightened `custom_ops_aot_lib` discovery: - default to package-local discovery - allow explicit override via `EXECUTORCH_CUSTOM_OPS_AOT_LIB` - removed implicit repo-local `cmake-out*` scanning - added tests for: - recurrent op parity vs reference - `.out` variant behavior - chunked-state parity vs full-sequence execution - custom-op vs fallback attention parity - tiny Qwen3.5 export selecting `llama.recurrent_gated_delta_rule` ## Validation ### Linux CPU-only (aarch64) Built `custom_ops_aot_lib` successfully and loaded it via `EXECUTORCH_CUSTOM_OPS_AOT_LIB`. Passed: - `pytest extension/llm/custom_ops/test_update_cache.py::RecurrentGatedDeltaRuleTest -q` - `3 passed` - `pytest examples/models/llama/tests/test_qwen3_5_attention.py -q` - `7 passed` - `pytest examples/models/llama/tests/test_export_llama_lib.py::ExportLlamaLibTest::test_tiny_qwen35_export_uses_recurrent_gated_delta_rule -q` - `1 passed` ### Real-model CPU validation On a real `Qwen3.5-0.8B` CPU run, fused recurrence matched the fallback path on next-token selection with very small logit drift, and improved eager prefill latency on the tested prompt. Observed on local CPU validation: - same next token from fused path vs fallback - max logit diff on the order of `1e-5` - eager prefill speedup about `1.6x` on the tested prompt ### Windows note A local Windows-only FFHT/MSVC workaround was used during development to keep the local build usable, but that workaround is intentionally **not** included in this PR. ## Non-goals / separate issues I did not treat the local `program.fbs` serialization issue as part of this change. This branch does not modify `exir/_serialize/*` or `schema/program.fbs`, and serialization-focused checks passed on both this branch and clean `main` once the local environment was set up correctly. A separate end-to-end tiny Qwen3.5 `.pte` export probe hit: - `RuntimeError: Missing out variants: {'aten::alias'}` That appears to be a separate pre-existing export issue outside this change set. cc @larryliu0820 @mergennachin @cccclai @helunwencser @jackzhxng --------- Co-authored-by: Digant Desai <digantdesai@meta.com> Co-authored-by: Nikhil Viswanath Sivakumar <68182521+nil-is-all@users.noreply.github.com>

xingguo01 and others added 18 commits April 23, 2026 18:08

Extract shared multifunction PTE utilities to utils.py (pytorch#19035)

6d23e41

Differential Revision: D101887672 Pull Request resolved: pytorch#19035

Add add-relu fusion in the quantizer

7b5dcc1

Differential Revision: D102189156 Pull Request resolved: pytorch#19077

Widen resolve_max_new_tokens parameters to int64_t and rename for cla…

eef7921

…rity (pytorch#18917) Differential Revision: D99769848 Pull Request resolved: pytorch#18917

skip cuda operations when running qwen 3.5 moe on other backend (pyto…

b6cec38

…rch#19095) This PR makes GPU related operator cuda-backend specific, to bring metal qwen 3.5 moe ci back

Arm backend: Add util for symbolic range eval (pytorch#19108)

c5c5b3a

Adds util for computing a value range from a symbolic expression. cc @digantdesai @freddan80 @per @zingo @mansnils @Sebastian-Larsson @robell Signed-off-by: Oscar Andersson <oscar.andersson@arm.com>

Remove un-used copy from building dockers (pytorch#18868)

98a1d66

The removed copy seems to be stale, it is never used.

Merge branch 'polycam' into main

cd45d6a

jgibson2 merged commit 6e6c2a7 into polycam Apr 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Main#5

Main#5
jgibson2 merged 18 commits intopolycamfrom
main

jgibson2 commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

15 participants

Conversation

jgibson2 commented Apr 24, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

15 participants